-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop the MEET packet if the link node is in handshake state #1436
Conversation
After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting, and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <[email protected]>
The ci fail link: https://github.com/valkey-io/valkey/actions/runs/12306207562/job/34347454716#step:3:4364 Runing in a loop can trigger the failiure @pieturin could you take a look and see if i do it right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the MEET
flow.
Something else we could do, is to avoid sending MEET packets repeatedly. We can introduce a node field like Lines 4978 to 4986 in 5f7fe9e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay merging this and following up as well, given how consistently it causes issues in the CI.
i forget to mention if we double this line, we will get the crash in #778. clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
+ clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING); |
Signed-off-by: Binbin <[email protected]>
i am merging it to unblock the CI, please feel free to update the comment or code if needed in #1441. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1436 +/- ##
============================================
- Coverage 70.90% 70.79% -0.12%
============================================
Files 119 119
Lines 64631 64691 +60
============================================
- Hits 45828 45799 -29
- Misses 18803 18892 +89
|
100% agreed. The entropy in the |
Dropping newer MEET messages feels counterintuitive. While it may help with the assert, it seems to compromise resilience. In this situation, the node associated with the first MEET is likely stale. Could this lead to the MEET ultimately failing, effectively nullifying the retries on the sender side? |
After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.
If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.
Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.